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Abstract 

The free energy functional has recently been proposed as a variational principle for bounded 
rational decision-making, since it instantiates a natural trade-off between utility gains and 
information processing costs that can be axiomatically derived. Here we apply the free 
energy principle to general decision trees that include both adversarial and stochastic en- 
vironments. We derive generalized sequential optimality equations that not only include 
the Bellman optimality equations as a limit case, but also lead to well-known decision-rules 
such as Expectimax, Minimax and Expectiminimax. We show how these decision-rules can 
be derived from a single free energy principle that assigns a resource parameter to each 
node in the decision tree. These resource parameters express a concrete computational 
cost that can be measured as the amount of samples that are needed from the distribu- 
tion that belongs to each node. The free energy principle therefore provides the normative 
basis for generalized optimality equations that account for both adversarial and stochastic 
environments. 

Keywords: Foundations of AI, free energy, Bellman optimality equations, bounded ratio- 
nality. 

1. Introduction 

Decision trees are a ubiquitous tool in decision theory and artificial intelligence research to 
represent a wide range of decision-making problems that include the classic reinforcement 
learning paradigm as well as competitive games (Osborne and Rubinstein, 1999; Russell and 
Norvig, 2010). Depending on the kind of system one is interacting with, there are different 
decision rules one has to apply — the most famous ones being Expectimax, Minimax and 
Expectiminimax — see Figure 1. When an agent interacts with a stochastic system, the 
agent chooses its decisions based on Expectimax. Essentially, Expectimax is the dynamic 
programming algorithm that solves the Bellman optimality equations, thereby recursively 
maximizing expected future reward in a sequential decision problem (Bellman, 1957). 

In two-player zero-sum games where strictly competitive players make alternate moves, 
an agent should use the Minimax strategy. The motivation underlying minimax decisions is 
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that the agent wants to optimize the worst-case gain as a means of protecting itself against 
the potentially harmful decisions made by the adversary. Finally, there are games that mix 
the two previous interaction types. For instance, in Backgammon, the course of the game 
depends on the skill of the players and chance elements. In these cases, the agent bases its 
decisions on the Expectiminimax rule (Michie, 1966). 



Expectimax Minimax Expectiminimax 




Figure 1: Illustration of Expectimax, Minimax and Expectiminimax in decision trees rep- 
resenting three different interaction scenarios. The internal nodes can be of three 
possible types: maximum (a), minimum (v) and expectation (o). The optimal 
decision is calculated recursively using dynamic programming. 

What is common to all of these decision-making schemes is that they presuppose a fully 
rational decision-maker that is able to compute all of the required operations with absolute 
precision. In contrast, a bounded rational decision-maker trades off expected utility gains 
against the cost of the required computations (Simon, 1984). Recently, the free energy 
has been suggested as a normative variational principle for such bounded rational decision- 
making that takes the computational effort into account (Ortega and Braun, 2011; Braun 
and Ortega, 2011; Ortega, 2011). This builds on previous work on efficient computation of 
optimal actions that trades off the benefits obtained from maximizing the utility function 
against the cost of changing the uncontrolled dynamics given by the environment (Kappen, 
2005; Todorov, 2006, 2009; Kappen et al., 2012). The aim of this paper is to extend these 
results to generalized decision trees such that Expectimax, Minimax, Expectiminimax, and 
bounded rational acting can all be derived from a single optimization principle. Moreover, 
this framework leads to a natural measure of computational costs spent at each node of the 
decision tree. All the proofs are given in the appendix. 

2. Free Energy 

2.1. Equilibrium Distribution 

In Ortega and Braun (2011) and in Ortega (2011) it was shown that a bounded rational 
decision-making problem can be formalized based on the negative free energy difference 
between two information processing states represented by two probability distributions P 
and Q. The decision process then transforms the initial choice probability Q into a final 
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choice probability P by taking into account the utility gains (or losses) and the transfor- 
mation costs. This transformation process can be formalized as 

p{x) = ^Q{x)e aU ^\ where Z = ^ Q{x)e aU{ - x) . (1) 

X 

Accordingly, the choice pattern of the decision-maker is predicted by the equilibrium dis- 
tribution P. Crucially, the probability distribution P extremizes the following functional 
(Callen, 1985; Keller, 1998): 

Definition 1 (Negative Free Energy Difference) Let Q be a probability distribution 
and let U be a real-valued utility function over the set X . For any a E R, define the 
negative free energy difference -FcJ-P] Q£ 

F a [P] := P(x)U(x) - I £ P(x) log £g> (2) 

x a X Vl X J 

The parameter a is called the inverse temperature. 

Although strictly speaking, the functional -F a [.P] corresponds to the negative free energy 
difference, we will refer to it as the "free energy" in the following for simplicity. When 
inserting the equilibrium distribution (1) into (2), the extremum of F a yields: 

±]og(£Q(x)e° u to\ (3) 

^ X ' 

For different values of a, this extremum takes the following limits: 

lim — log Z = max U(x) (maximum node) 

a— >oo a x 

lim-MogZ = N Q(x)U(x) (chance node) 

a— >0 — ' 

x 

lim — log Z = min U (x) (minimum node) 

a— I— oo a x 

The case a — > oo corresponds to the perfectly rational agent, the case a — > corresponds to 
the expectation at a chance node and the case a — > — oo anticipates the perfectly rational 
opponent. Therefore, the single expression — logZ can represent the maximum, expectation 
and minimum depending on the value of a. 

The inspection of (2) reveals that the free energy encapsulates a fundamental decision- 
theoretic trade-off: it corresponds to the expected utility, penalized — or regularized — by the 
information cost of transforming the base distribution Q into the final distribution P. The 
inverse temperature plays the role of the conversion factor between units of information and 
units of utility. 

If we want to change the temperature a to /3 while keeping the equilibrium and reference 
distributions equal, then we need to change the corresponding utilities from U to V in a 
manner given by the following theorem. Temperature changes will be important for the 
application of the free energy principle to the general decision trees in Section 3. 
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Theorem 2 Let P be the equilibrium distribution for a given inverse temperature a, utility 
function U and reference distribution Q. If the temperature changes to (3 while keeping P 
and Q fixed, then the utility function changes to 



V(x) = U(x) log 



P(x) 



2.2. Resource Costs 



Consider the problem of picking the largest number in a sequence Uq, Ui, U2, ... of i.i.d. 
data, where each Ui EU is drawn from a source with probability distribution M. After a 
draws the largest number will be given by max{E/i, U2, ■ ■ ■ , U a }- Naturally, the larger the 
number of draws, the higher the chances of observing a large number. 

Theorem 3 Let X be a finite set. Let Q and M be strictly positive probability distributions 
over X . Let a be a positive integer. Define M a as the probability distribution over the max- 
imum of a samples from M. Then, there are strictly positive constants 5 and £ depending 
only on M such that for all a, 



Q(x)e aU W 

M a (x) 



< e 



-(a-OS 



Consequently, one can interpret the inverse temperature as a resource parameter that 
determines how many samples are drawn to estimate the maximum. Note that the distri- 
bution M is arbitrary as long as it has the same support as Q. This interpretation can 
be extended to a negative a, by noting that aU(x) = (—a)(—U(x)), i.e. instead of the 
maximum we take the minimum of —a samples. 



3. General Decision Trees 

A generalized decision tree is a tree where each node corresponds to a possible interaction 
history x<t €. X 1 , where t is smaller or equal than some fixed horizon T, and where edges 
connect two consecutive interaction histories. Furthermore, every node x<t has an associ- 
ated inverse temperature /3(x<t); and every transition has a base probability Q{x t \x < t) of 
moving from state x<j to state x<t = x<t%t representing the stochastic law the interactions 
follow when it is not controlled, and an immediate reward i?(xj|x<t). The objective of the 
agent is to make decisions such that the sum J2t=i R{ x t\x<t) is maximized subject to the 
temperature constraints. 



3.1. Free Energy for General Decision Trees 

The free energy principle is stated above for one decision variable x. If x represents a tuple 
of (possibly dependent) random variables xi, . . . ,xt, then the free energy principle can 
be applied in a straightforward manner to the corresponding tree. However, all nodes of 
the tree will have the same inverse temperature assigned to them and, therefore, the same 
amount of computational resources will be spent at each node of the tree. This allows for 
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Figure 2: The free energy formalism can only be applied in a straightforward manner to 
trees with uniform resource allocation (left). In order to apply it to general trees 
that have different resource parameters at each node (right), we need to transform 
the utilities as described in (4) to preserve the equilibrium distribution. 



example deriving the formalisms of path integral control and KL control (Todorov, 2009; 
Braun and Ortega, 2011; Kappen et al., 2012). 

In the case of general decision trees the assumption of uniform temperatures has to be 
relaxed (Figure 2). In general, we can then dedicate different amounts of computational 
resources to each node of the tree. However, this requires a translation between a tree 
with a single temperature and to a tree with different temperatures. This translation can 
be achieved using Theorem 2. Define a reward as the change in utility of two subsequent 
nodes. Then, the rewards of the resulting decision tree are given by 

R(x t \x <t ) := [V(x< t ) - V(x <t )j = [U(x< t ) - U(x <t )] - (l - ^) l°g^gy- (4) 

This allows introducing a collection of node-specific (not necessarily time-specific) inverse 
temperatures f3(x < t), allowing for a greater degree of flexibility in the representation of 
information costs. The next theorem states the connection between the free energy and the 
general decision tree formulation. 

Theorem 4 The free energy of the whole trajectory can be rewritten in terms of rewards: 

- »( £ ) + E^)E{*k«) - ^ *^}- w 

This translation allows applying the free energy principle to each node with a different 
resource parameter f3(x<t). By writing out the sum in (5), one realizes that this free energy 
has a nested structure where the latest time step forms the innermost variational problem 
and all other variational problems of the previous time steps can be solved recursively by 
working backwards in time. This then leads to the following solution: 
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Theorem 5 The solution to the free energy in terms of rewards is given by 

P(x t \x <t ) = — Q(x t \x <t ) exp j/3(x <t ) [R(x t \x <t ) + lo § Z ( x <t)] }j 

where Z(x<t) = 1 and where for all t < T 

z ( x <t) = ^2 Q(x t \x <t ) exp|/3(x< t ) [R(x t \x <t ) + * r log Z(x< t )] |. 



■ft 



3.2. Generalized Optimality Equations 



Theorem 5 together with the properties of the free energy extremum (3) suggest the following 
definition. 

Definition 6 (Generalized Optimality Equations) 

V(x <t ) = ^-ylog|^Q(x t |x <t )exp{/3(x<()[ J R(x t |x <i ) + y( 2 ;< i )]}|. 

By virtue of our previous analysis, this equation tells us how to recursively calculate the 
value function (i.e. the utility of each node) given the computational resources allocated in 
each node. 

It is immediately clear that the three kinds of decision trees mentioned in the introduc- 
tion are special cases of general decision trees. In particular, the three classical operators 
are obtained as limit cases: 



V(x <t ) 



' max{R(x t \x <t ) + V(x <t )} if p{x <t ) = oo, 
E{R(x t \x <t ) + V(x< t )} if/3(x <t ) = 0, 
mm{R(x t \x <t ) + V(x< t )} if f3(x <t ) = -oo. 



The familiar Bellman optimality equations for stochastic systems are obtained by consider- 
ing an agent decision node followed by a random decision node: 

V(x <t ) = max^R(x t \x <t ) + V(x< t )j 

:{R(x t \x <t ) + ~E[R(x t+1 \x< t ) + V(x< t+1 )] }. 



max-; 

xt 



4. Discussions & Conclusions 

Bounded rational decision-making schemes based on the free energy generalize classic decision- 
making schemes by taking into account information processing costs measured by the 
Kullback-Leibler divergence (Wolpert, 2004; Todorov, 2009; Peters et al, 2010; Ortega 
and Braun, 2011; Kappen et al., 2012). Ultimately, these costs are determined by Lagrange 
multiplier constraints given by the inverse temperature playing the role of a resource param- 
eter. Here we generalize this approach to general decision trees where each node can have 
a different resource allocation. Consequently, we obtain generalized optimality equations 
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for sequential decision-making that include the well-known Bellman optimality equation as 
well as Expectimax-, Minimax- and Expectiminimax-decision rules depending on the limit 
values of the resource parameters. The resource parameters themselves are amenable to 
interesting computational, statistical and economic interpretations. In the first sense they 
measure the number of samples needed from a distribution before applying the max operator 
and therefore correspond directly to computational effort. In the second sense they reflect 
the confidence of the estimate of the maximum and therefore they can also express risk 
attitudes. Finally, the resource parameters reflect the control an agent has over a random 
variable. These different ramifications need to be explored further in the future. 



Appendix A. Proofs 

A.l. Proof of Theorem 2 
Proof 

Since the equilibrium and reference distributions P{x) and Q(x) are constant but arbitrarily chosen, 
it must be that 

a Q[x) p Q(x) 

Hence, 



V(x) = U(x) -(h-js) l0 § 



Q( x y 



A. 2. Proof of Theorem 3 

Proof Let Xi,X2, ■ ■ ■ ,xjy be the ordering of X such that U(xi), U{x2), ■ ■ ■ , U(xjy). It is well 
known that the distribution over the maximum of a samples is equal to F a (x) = F(x) a , where 
F is the cumulative distribution F(x n ) — J2k<n M(xk). Denning F(x ) := 0, one has M a (x n ) — 
F(x n ) a — F(x n -i) a . Hence, the probability can be bounded as < M a (x n ) < F(x n ) a , or 

< M a (x n ) < e- a ^\ (6) 

if we use F(x n ) — e -7 ^ 11 -* where j(x n ) > 0. The Boltzmann distribution can be bounded as 

Q < Q{x n )e aU{x ^ < Q(x n )e aU ^ 



The upper bound is obtained by dropping all the summands in the expectation but the largest. In 
exponential form, the bounds are written as 

Q(x n )e aU{x ^ -aS(x n )+c(x n ) (7] 

where 5(x n ) := U(xn) — U(x n ), c{x n ) :— —logQ(xN) + log Q(x n ). Note that S(x n ) is positive. 
Subtracting the inequalities (6) from (7) yields 

_ -« 7 (x n ) < Q(x n )e aU(Xn) _ , . -a5(x n )+c(x n ) 
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Choosing £(x n ) = c(x n )/d(x n ) > allows rewriting the upper bound and changing the lower bound 
to 

_ -(o-f( In )) 7 ( In ) < Q(x n )e aU{x ^ _ p -(a-$(x n ))S(x n ) 

~ Z k Q(x k )e°v(*>) M a{ x n )<e 

Finally, choosing £ := max n {((i„)} and S = max{max„{5(a; n )}, max„{7(x„)}} yields the bounds 
of the theorem 

_ e -(«-S)5 < Q(^»)e aC/(x " ) e -(a-0« 



A. 3. Proof of Theorem 4 

Proof The free energy of the whole trajectory with inverse temperature a is given by 

Using a telescopic sum X)t=i(°t — a t-i) = a T — f° r the utilities yields 

U(e) + £ P(x< T ) [tf(*< t ) - U(x <t )} - I log ffi^U . 
Using the definition of rewards (4), one gets the result 

CT(e) + £ P(*< r ) HR(x t \x <t ) - log j- 

^ f3(x<t) Q(xt\x <t )) 



A. 4. Proof of Theorem 5 

Proof The inner sum of the free energy 

U(e) + £ F(x< T ) f:{R(x t \x <t ) - log }• 

can be expanded as 

H 

+ E^<,){^d^)-^.ogS^}.-.}}. 

8 



Generalized Optimality Equations 



This can be solved by induction, starting with the innermost sums and then recursively solving the 
outer sums. The innermost sums 



XT 

are maximized when 



P(x T \x <T ) = z ^ ^ Q(x T \x <T ) exp^/3(x <T )R(x T \x <T )Y 

This can be seen by noting that for probabilities pi and positive numbers > 0, the quantity 
J2iPi \°E(Pi/ r i) i s minimized by choosing pi = j?r i7 where Z — is just a normalizing constant. 

Substituting this solution yields the outer sums 

where ^ 

Z(x <t ) =^2Q(xt\x < t)exp{p(x < t)[R(x t \x < t) + y log Z(x< t )] }. 

These sums are then maximized by choosing 

P{x t \x <t ) = z ^^ Q(x t \x < t)exp^(3(x <t )[R(x t \x <t ) + -^-J— y log Z(x< t )] }. 
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